Hybrid transformer-CNN model for accurate prediction of peptide hemolytic potential

Hemolysis is a crucial factor in various biomedical and pharmaceutical contexts, driving our interest in developing advanced computational techniques for precise prediction. Our proposed approach takes advantage of the unique capabilities of convolutional neural networks (CNNs) and transformers to detect complex patterns inherent in the data. The integration of CNN and transformers' attention mechanisms allows for the extraction of relevant information, leading to accurate predictions of hemolytic potential. The proposed method was trained on three distinct data sets of peptide sequences known as recurrent neural network-hemolytic (RNN-Hem), Hlppredfuse, and Combined. Our computational results demonstrated the superior efficacy of our models compared to existing methods. The proposed approach demonstrated impressive Matthews correlation coefficients of 0.5962, 0.9111, and 0.7788 respectively, indicating its effectiveness in predicting hemolytic activity. With its potential to guide experimental efforts in peptide design and drug development, this method holds great promise for practical applications. Integrating CNNs and transformers proves to be a powerful tool in the fields of bioinformatics and therapeutic research, highlighting their potential to drive advancement in this area.

Although traditional methods offer valuable information, they have restrictions in terms of scalability, efficiency, and the ability to understand complex sequence-structure relationships.As a result, researchers have increasingly relied on computational methods to enhance and streamline prediction.Numerous computational strategies have been investigated, including machine learning algorithms and advanced deep learning architectures.Machine learning models 12 , including support vector machines (SVM) and random forests, have been applied to predict hemolytic potential based on peptide sequences [12][13][14][15][16][17] .Deep learning models including RNN and transfer learning models, were used 18,19 .Although these methods have shown considerable predictive abilities, their effectiveness is highly dependent on the specific features chosen and may not fully capture intricate connections within peptide sequences.However, in recent times, deep learning techniques such as CNNs and transformers have emerged as powerful tools for automatically extracting hierarchical characteristics and comprehending long-range relationships in sequences 20 .Using these architectures, we can potentially improve the precision and speed in predicting hemolytic activity.The specialized design of CNNs allows effective detection of local patterns, while the innovative use of attention mechanisms in transformers enables the identification of broader connections within sequences 19,21 .The proposed approach was built on this literature to contribute a novel perspective to predicting hemolytic activity.This synergistic combination enables our model to learn complex sequencestructure relationships with exceptional accuracy, exceeding the limitations of previous methods.The critical insights drawn from existing literature guide our methodology, laying the groundwork for a comprehensive and innovative approach to predicting hemolytic activity in peptides.Critically, theoretical modeling approaches based on ordinary differential equations (ODEs) have been instrumental in predicting diseases and deciphering intricate biological processes.Studies utilizing ODE-based theoretical modeling, such as those referenced [22][23][24] provide valuable insights into dynamic systems and can complement our computational framework for predicting hemolytic activity.By incorporating these theoretical modeling paradigms into our discussion, we aim to not only enhance the depth of our analysis but also highlight future research directions.The integration of computational methods with theoretical modeling promises to further advance our understanding of hemolysis in peptides, ultimately contributing to more effective drug design and therapeutic strategies.For more details on the identification of peptides using mathematical models, the reader can refer to DiMaggio et al. 25 .On the other hand, for more details on how to formulate real-world problems as mathematical models, the reader is referred to Badr et al. [26][27][28] .
The advancement of interaction prediction research in computational biology, particularly the use of graph neural networks (GNNs) for miRNA-lncRNA interaction prediction, has provided valuable insights into genetic markers and non-coding RNAs.It is essential to cite pivotal computational models in this domain, such as those detailed in studies [29][30][31][32][33][34][35][36] , which have contributed significantly to the field.Furthermore, acknowledging the progress in interaction prediction research across various computational biology domains is vital.These studies offer valuable insights into genetic markers and associated diseases, underscoring the importance of referencing key computational models within these domains.Relevant studies 37,38 should be included to highlight the advancements and contributions to the field.
In this paper, our structure is as follows: after this introduction, we will dive into the methodology we utilized to construct and train our predictive models, exploring the reasoning behind incorporating CNNs and transformers.We will then discuss our results and evaluate their performance.Finally, we emphasize the importance of our research and suggest potential avenues for further advancement in predictive modeling for peptide design and biomedical applications.

Data and methods
In this section, we provide a detailed overview of the datasets used and the methodology used in our study to predict hemolytic activity in peptides utilizing CNNs and transformers.

Data collection
Our research uses a variety of datasets, ensuring that our predictive models are accessible and widely applicable.The main datasets utilized in this investigation comprise RNN-Hem 18 , Hlppredfuse 12 , and Combined 19 .These datasets incorporate a diverse set of peptide sequences with documented hemolytic activities that serve as the basis for the development, validation, and testing phases of our models.
Table 1 presents a comprehensive overview of the datasets utilized in our research, emphasizing their distinct sources and composition of positive (hemolytic) and negative sets (non-hemolytic).The datasets, namely RNN-Hem, Hlppredfuse, and Combined, have been curated from reputable sources in the field.Each dataset contributes to the diversity of our study by incorporating a wide range of peptide sequences with documented hemolytic activities.RNN-Hem Sourced from Capecchi et al. 18 , this dataset includes 1359 instances in the positive set and 1198 instances in the negative set.Hlppredfuse 12 obtained from Hasan et al. 12  1096 instances in the positive set and 2422 instances in the negative set.Combined with an extract from Salem et al. 19 , this data set incorporates 3007 instances in the positive set and 4172 instances in the negative set.These datasets are crucial for the success of our model development process.By incorporating a variety of sources and a large number of instances, our predictive models can utilize a diverse and comprehensive sample.This improves their strength and ability to be applied in various situations.In the following sections, we will discuss in detail the techniques used in handling and harnessing these datasets for training and assessing our models.

Data representation
The way we represent peptide sequences profoundly influences the ability of deep-learning models to unlock their hemolytic potential.Automated representation based on deep learning of biological sequences is effective while saving time and effort in traditional methods of gathering information 39 .A thoughtfully designed numerical representation not only captures the essence of each amino acid but also cultivates a structured landscape where patterns of hemolytic activity can emerge.In this pursuit, we embarked on decoding the hidden language of peptides, carefully crafting a representation that enables our models to delve into the depths of peptide sequences and illuminate their hidden relationships with hemolysis.Each peptide sequence was segmented into its fundamental amino acid units, creating a vocabulary of 20 distinct amino acid symbols.Each amino acid token was assigned a unique numerical index, effectively translating the symbolic sequence into a numerical format suitable for computational processing.To maintain consistency in input dimensions for deep learning models, we padded sequences with zeros up to a fixed maximum length of 50.Given that most of the peptides in our datasets possess lengths below 50, we opt for this maximum length to efficiently represent the majority of sequences while maintaining sufficient capacity for potential long-range dependencies within this range.This ensures a uniform input structure, even with varying sequence lengths.Through this carefully designed numerical representation, we transformed the raw peptide sequences into a structured format that empowers our deep learning models to uncover the intricate relationships between amino acid composition and hemolytic potential.
As shown in Fig. 1, each amino acid within the peptide sequence (LAEWNAE) is transformed into a unique numerical index.For example, the first amino acid L is represented as 5.This encoding preserves the distinct identity of each amino acid while facilitating efficient processing by deep learning models.By padding shorter sequences with zeros up to a maximum length of 50 (as shown in the figure), we ensure a consistent input format regardless of the peptide's actual length, enabling the models to focus on the relevant sequence patterns.

Methodology
The intricacies of peptide hemolysis are analogous to deciphering a complex puzzle, where individual amino acids serve as the pieces and their arrangement dictates the hemolytic potential.In this effort, we constructed a deep learning architecture that seamlessly integrates local and global analyses, as shown in Fig. 2, harnessing the complementary strengths of CNNs and transformer-based attention mechanisms.
At first, CNNs play a pivotal role as they meticulously scan the peptide sequence.They diligently detect recurring patterns, examine the bonds between adjacent amino acids, and unravel the close-range connections that contribute to the fundamental components of hemolytic activity.Similarly to recognizing familiar melodies, CNNs establish a solid understanding of how local collaborations shape the initial characteristics of the hemolytic profile.However, the complexity of the melody goes beyond these immediate harmonies.Here, the Transformers take the lead.With attention mechanisms that span the entire sequence, they carefully study the subtle relationships between distant amino acids.This global perspective unveils long-range collaborations that can enhance or mitigate the hemolytic tendencies established by local motifs.These previously overlooked connections now become integral components, enriching the model's understanding of the peptide's overall hemolytic potential.The synergy between local analysis and global exploration is fundamental to the power of our architecture.The insights obtained, whether short-range motifs identified by CNNs or long-range connections revealed by transformers undergo meticulous processing by dedicated feed forward networks 40 .These processes delicately shape the raw data, providing an all-encompassing description of the subtle mechanisms that drive hemolytic activity.It is like extracting the very essence of a puzzle, capturing every subtle detail and interplay that forms the hemolysis profile.
Throughout the training process, we carefully select a specific set of hyperparameters to enhance the performance of our model.These hyperparameters consist of pre-training adjustments that impact the behavior of the model but are not acquired during the training itself.In this context, some noteworthy examples of hyperparameters include the quantity and dimensions of filters utilized in the convolutional layers, the size of pooling windows employed in the pooling layers, the number of neurons within the fully connected layers, the optimizer's learning rate, the duration of time to train the model, as well as the batch size.For further details, please refer to Table 2 which outlines the specific hyperparameters used during the training process.

Software and hardware
The development and execution of machine learning models were carried out seamlessly using a comprehensive set of software and hardware resources.Python (3.10) emerged as the primary programming language, supported by essential libraries such as Pandas, NumPy, Matplotlib, and scikit-learn for data manipulation, analysis, and visualization.Deep learning models were implemented with TensorFlow (2.13.0).In terms of hardware, Kaggle computational resources, including GPU (GPU T4 ×2) capabilities, were used for model training and evaluation.

Model evaluation
To evaluate the performance of the hybrid Transformer-CNN model, we used accuracy (Acc), precision, recall, Area under the ROC Curve (ROC-AUC), and Matthews correlation coefficient (MCC) 41 .The evaluation metrics are defined in the following equations: (1) Acc = TP + TN TP + TN + FP + FN × 100

Results
In this section, we present the performance metrics of our proposed hybrid transformer-CNN architecture model across three distinct datasets.RNN-Hem 18 , Hlppredfuse, and Combined.Furthermore, a comprehensive comparative analysis with previously used methods further elucidates the efficacy of our model.Table 3 shows that the model achieved substantial accuracy (79.69%), precision (82.93%), recall (76.69%),ROC-AUC (0.861), and MCC (0.5962) in the RNN-Hem dataset, indicating its ability to identify hemolytic activity within peptide sequences.Demonstrated exceptional performance with high accuracy (96.16%), precision (93.27%), recall (94.55%),ROC-AUC (0.976), and MCC (0.9111) in the Hlppredfuse 12 dataset, showcasing the robustness of the model in predicting hemolytic potential.The Combined dataset displayed commendable metrics, with notable accuracy (89.28%), precision (87.59%), recall (86.41%),ROC-AUC (0.942), and MCC (0.7788), highlighting the consistency of the model in various datasets.The hybrid transformer-CNN architecture model consistently exhibits strong predictive capabilities across varied datasets, underscoring its versatility and effectiveness in accurately predicting hemolytic potential in peptides.
In Table 4, our proposed hybrid transformer-CNN architecture model exhibited competitive or superior metrics in the RNN-Hem dataset 18 , showcasing its effectiveness in achieving comparable or even better predictive performance compared to established classifiers.AMPDeep 19 demonstrated competitive accuracy, precision, recall, ROC-AUC, and MCC, positioning itself as a strong contender against traditional classifiers.Existing classifiers, namely SVM-Hem 18 , RF-Hem 18 , and RNN-Hem 18 , achieved moderate performance but were surpassed by the proposed model and AMPDeep 19 .Moving to Table 5, our proposed model outperformed existing classifiers in the Hlppred-Fuse dataset 12 in terms of accuracy, precision, recall, ROC-AUC, and MCC, highlighting its efficacy  www.nature.com/scientificreports/ in accurately predicting hemolytic potential.Although AMPDeep 19 showed strong performance metrics, the proposed model surpassed it in multiple evaluation criteria.The existing classifiers exhibited varied performance, underscoring the superiority of our proposed model in predicting hemolytic activity.In Table 6, the proposed model demonstrated better accuracy, precision, recall, ROC-AUC, and MCC compared to existing classifiers in the Combined data set 19 , indicating its robustness in predicting hemolytic potential.Although AMPDeep 19 showed competitive performance, the proposed model outperformed it in multiple evaluation metrics.This collective evidence underscores the consistent effectiveness of our proposed hybrid transformer-CNN architecture model in predicting hemolytic activity across interdisciplinary datasets, positioning it as a powerful and versatile tool in computational biology.Table 7 presents the performance metrics of the model without the CNNs module across three datasets: RNN-Hem, Hlppredfuse, and AMP-Combined.This table is crucial for understanding the impact of the CNNs module on the overall model performance and identifying its supportive contribution.In the RNN-Hem dataset, removing the CNNs module led to a decrease in accuracy from 79.69 to 74.02%, precision from 82.93 to 79.04%, recall from 76.69 to 68.05%, ROC-AUC from 0.861 to 0.7424, and MCC from 0.5962 to 0.4877.Similarly, in the Hlppredfuse dataset, the model without CNNs showed reduced performance in accuracy, precision, recall, ROC-AUC, and MCC compared to the full model.The AMP-Combined dataset also exhibited lower metrics without the CNNs module, indicating its significant contribution to the model's predictive capabilities across different datasets.
To examine the model learning process, we visualized its accuracy and loss curves in the three data sets, as shown in Fig. 3.In particular, the accuracy curves for all datasets exhibited a consistent upward trend, indicating successful learning and convergence towards optimal performance.This pattern was particularly evident for the Hlppredfuse dataset, where the model achieved remarkable accuracy during training.Loss curves showed a steady downward trajectory, reflecting a gradual reduction in prediction errors as training progressed.This decline was particularly pronounced for the AMP-Combined dataset, demonstrating efficient error minimization.Collectively, these curves affirm the model's ability to effectively learn from the training data and refine its predictive capabilities over time.This robust learning behavior underpins the model's exceptional performance in predicting peptide hemolytic activity.This decline was particularly pronounced for the Combined dataset, demonstrating efficient error minimization.Collectively, these curves affirm the model's ability to effectively learn from the training data and refine its predictive capabilities over time.This robust learning behavior underpins the model's exceptional performance in predicting peptide hemolytic activity.
The training process is a critical aspect of model development and influences both the time required for convergence and the complexity of the trained model.Table 8 provides information on the training time for each dataset and the corresponding number of trainable parameters in the proposed hybrid transformer-CNN model.The proposed model comprises a total of 11,748,097 trainable parameters, indicating the complexity of the neural network architecture.This parameter count encompasses the weights and biases in the convolutional and transformer layers, as well as the fully connected layers, contributing to the model's ability to capture intricate patterns within peptide sequences.

Conclusions
In conclusion, our research presents an innovative computational method for forecasting the hemolytic potential of peptides.By combining the strengths of CNNs and transformer-based attention mechanisms, our hybrid transformer-CNN model can detect complex patterns within peptide sequences.This results in highly accurate predictions of hemolytic activity.Our model's success can be seen in its performance on various datasets, such as RNN-Hem, Hlppredfuse, and Combined.The proposed method achieved the highest prediction accuracy with Matthews's correlation coefficients of 0.5962, 0.9111, and 0.7788 on these datasets, respectively.Comparative analyses highlight the competitive or superior performance of our hybrid Transformer-CNN architecture model compared to existing classifiers.Across the RNN-Hem, Hlppredfuse, and Combined datasets, our model outperforms or matches the performance of established methods, demonstrating its effectiveness in addressing the challenges associated with predicting hemolytic potential.Despite these successes, our model has limitations that must be considered.The model's performance is heavily dependent on the quality and diversity of the training datasets.The current datasets may not cover all possible peptide variations, potentially affecting the model's generalizability.The computational intensity required for training and optimizing the model may not be accessible to all researchers, given the need for high-performance GPUs and substantial memory capacity.The complexity of the model poses challenges in interpretability, the predictions generated by the model need to be experimentally validated to confirm their accuracy and reliability in real-world scenarios.Future research could  explore the extension of our model to additional datasets, further validating its generalizability.Additionally, finetuning the model's hyperparameters and exploring different architectural configurations may offer opportunities for refinement and improvement.Our work sets the stage for continued advancements in predictive modeling of hemolytic activity, with potential implications for the broader fields of bioinformatics and drug discovery.Finally, partially ordered sets can be used according to their effect on red blood cell hemolysis, presenting a promising direction for future investigations.

Figure 1 .
Figure 1.encoding applied to the peptide sequence.

Figure 2 .
Figure 2. Hybrid transformer-CNN architecture for predicting hemolytic activity of peptides.

Table 1 .
, this data set comprises Overview of data sets used in the study.

Table 2 .
Parameter settings for the proposed model.

Table 3 .
Performance of the proposed model in the three data sets.

Table 4 .
Comparison of the proposed model with the previous methods in the RNN-Hem dataset.Bold values are for the best-performing model.

Table 5 .
Comparison of the proposed model with the previous methods in the HLPpred-Fuse dataset.Bold values are for the best performing model.

Table 6 .
Comparison of the proposed model with the previous methods in the AMP-Combined dataset.Bold values are for the best performing model.

Table 7 .
Performance of the model without CNN in the three data sets.

Table 8 .
Training time and training parameters that were associated with the proposed model.